Contents

Loading in the Data
Aire Trace Processing
Control Cell Processing
Make Figures

This is a document that can be used to reproduce all analysis from the (eventual paper). While all steps are included, some steps take a long time and it is faster to just load in the provided R object.

For best reproducability, run this analysis using the Snakemake pipeline provided as package differences can make minor changes to clustering and dimensionality reduction.

If you would like to skip the data processing and just make figures from pre-made objects, go here

All analysis depends on loading the mTEC.10x.pipeline r package

library(mTEC.10x.pipeline)
## Warning: replacing previous import 'dplyr::combine' by 'gridExtra::combine'
## when loading 'mTEC.10x.pipeline'

Loading in the data

In the Snakemake pipeline, this correponds to rule seurat_object and runs the script create_seurat.R

All data is provided in the form of empty R Seurat objects that were made by running the command

mtec_data <- Seurat::Read10X(data.dir = sequence_dir)
mtec <- CreateSeuratObject(raw.data = mtec_data, min.cells = 3, min.genes = 200,
  project = project_name)

where “sequence_dir” is the path to the output of cellranger count and project_name = “RankL_ablation”

Aire trace processing

In the Snakemake pipeline, this correponds to rule final_analysis and runs the script analysis_driver.R found in the aireTrace/scripts directory

This is the processing that was done to create the aireTrace Seurat object. If you would rather skip this processing, skip to figures section.

load in necessary packages

library(Seurat)
## Loading required package: ggplot2
## Loading required package: cowplot
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
## Loading required package: Matrix
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(scran)
## Loading required package: BiocParallel
## Loading required package: SingleCellExperiment
## Loading required package: SummarizedExperiment
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:Matrix':
## 
##     colMeans, colSums, rowMeans, rowSums, which
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:Matrix':
## 
##     expand
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
## 
##     count
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
## 
##     aperm, apply
library(DDRTree)
## Loading required package: irlba

load in empty aireTrace object and gene lists

# Change this to load from the mtec.10x.data package
load("/home/kwells4/mTEC_dev/mtec_snakemake/aireTrace/analysis_outs/seurat_aireTrace_empty.rda")
data_dir <- "/home/kwells4/mTEC_dev/data/"
load(paste0(data_dir, "TFs.rda"))
load(paste0(data_dir, "gene_to_ensembl.rda"))

Set plotting colors

stage_color_df <- data.frame("Cortico_medullary" = "#CC6600",
                             "Ccl21a_high" = "#009933",
                             "Early_Aire" = "#0066CC",
                             "Aire_positive" = "#660099",
                             "Late_Aire" = "#FF0000",
                             "Tuft" = "#990000",
                             "unknown" = "#666666")

stage_color <- t(stage_color_df)[ , 1]

stage_levels <- c("Cortico_medullary", "Ccl21a_high", "Early_Aire",
                  "Aire_positive", "Late_Aire", "Tuft", "unknown")

Initial processing. This is run through the mTEC.10x.pipeline package, but combines the initial processing steps recommended by Seurat

# Add mitochondiral percent to the meta data
mtec <- add_perc_mito(mtec)

# Plot quality plots
qc_plot(mtec)

# remove low quality cells, normalize and scale data, find variable genes, and perform PCA 
mtec <- process_cells(mtec)
## Scaling data matrix

## [1] "PC1"
## [1] "Krt5"   "Igfbp4" "Gsn"    "Krt14"  "Gas1"  
## [1] ""
## [1] "Cyba"  "Cldn7" "Cd52"  "Cox17" "Csn2" 
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "Srgn"    "S100a16" "Ubd"     "Aire"    "S100a14"
## [1] ""
## [1] "Rgs13"   "Ly6g6f"  "Alox5ap" "Lrmp"    "Avil"   
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "Spink5"   "Pdzk1ip1" "Oit1"     "Lypd3"    "Gsta4"   
## [1] ""
## [1] "Tmsb10" "Vim"    "Phgdh"  "Top2a"  "Birc5" 
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "Selplg"  "Samsn1"  "Zmynd15" "Ccr7"    "Cytip"  
## [1] ""
## [1] "Gm13305" "S100a16" "Il11ra1" "Fezf2"   "Lrrc42" 
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "Birc5"  "Ccna2"  "Cdca8"  "Nusap1" "Stmn1" 
## [1] ""
## [1] "Mia"    "Sparc"  "Stkld1" "Mgp"    "Gpx3"  
## [1] ""
## [1] ""
# Plot a PCA and determine dimensions to use by looking at heatmaps and elbow plots
PC_plots(mtec, jackstraw = FALSE, test_pcs = 1:20)

If running on a cluster and if you have the time, you can make a jackstraw plot to test the number of PCs to keep. Run using:

PC_plots(mtec, jackstraw = TRUE, test_pcs = 1:20)

After determining the number of PCs to use from the PC_plots outputs, clustering and dimensional reduction can be performed using the appropriate number of PCs. Here, we used the first 12 PCs.

# Decide PCs based on output of jackstraw plot
mtec <- group_cells(mtec, dims_use = 1:12)

Now that we have clusters, we can find marker genes of each cluster to be able to identify what cell types are contained in each cluster.

mtec.markers <- FindAllMarkers(object = mtec, only.pos = TRUE, min.pct = 0.25,
  thresh.use = 0.25)

# Only print the top 10 markers of each cluster

top10 <- mtec.markers %>% group_by(cluster) %>% top_n(10, avg_logFC)

print(top10)
## # A tibble: 90 x 7
## # Groups:   cluster [9]
##       p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene  
##       <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr> 
##  1 4.53e-69     0.986 1     0.896  8.18e-65 0       Apoe  
##  2 6.58e-68     1.00  0.957 0.661  1.19e-63 0       C3    
##  3 1.17e-66     0.880 1     0.869  2.11e-62 0       Krt14 
##  4 1.89e-62     0.832 0.983 0.771  3.42e-58 0       Gsn   
##  5 9.24e-61     1.01  0.954 0.662  1.67e-56 0       Igfbp4
##  6 8.89e-58     0.844 0.844 0.431  1.61e-53 0       Col6a2
##  7 6.00e-57     1.10  0.877 0.536  1.08e-52 0       Gpx3  
##  8 1.10e-52     0.833 0.868 0.542  1.98e-48 0       Col6a1
##  9 8.33e-41     0.925 0.805 0.419  1.51e-36 0       Tagln 
## 10 1.13e-28     0.992 0.702 0.432  2.03e-24 0       Acta2 
## # … with 80 more rows
DoHeatmap(object = mtec, genes.use = top10$gene, slim.col.label = TRUE, remove.key = TRUE)

Use the gene list to name clusters. We chose to merge clusters 0, 1, and 2 because of their highly similar gene expression patterns, especially levels of Ccl21a expression.

stage_list <- c("0" = "Ccl21a_high", "1" = "Ccl21a_high", "2" = "Ccl21a_high",
                "3" = "Late_Aire", "4" = "Cortico_medullary",
                "5" = "Early_Aire", "6" = "Aire_positive",
                "7" = "Tuft", "8" = "unknown")

mtec <- set_stage(mtec, stage_list)

mtec@meta.data$stage <- factor(mtec@meta.data$stage,
                               levels = stage_levels)

Now we can find markers of each stage of development and determine cell cycle phase

# These steps take a long time. Only run if you have processing power.
mtec <- Seurat::StashIdent(mtec, save.name = "seurat_cluster")

mtec <- Seurat::SetAllIdent(mtec, id = "stage")

mtec@ident <- factor(mtec@ident, levels = stage_levels)

mtec <- significant_markers(mtec)

mtec <- run_cyclone(mtec, gene_to_ensembl)

Check the data visually

load("/home/kwells4/mTEC_dev/mtec_snakemake/aireTrace/analysis_outs/seurat_aireTrace.rda")

tSNE_PCA(mtec, "seurat_cluster", PCA = TRUE)

## $umap

## 
## $pca

tSNE_PCA(mtec, "cluster", PCA = TRUE, color = stage_color)
## Warning: Removed 36 rows containing missing values (geom_point).

## Warning: Removed 36 rows containing missing values (geom_point).

## $umap
## Warning: Removed 36 rows containing missing values (geom_point).

## 
## $pca
## Warning: Removed 36 rows containing missing values (geom_point).

tSNE_PCA(mtec, "Aire")

## $umap

tSNE_PCA(mtec, "Ccl21a")

## $umap

tSNE_PCA(mtec, "Mki67")

## $umap

tSNE_PCA(mtec, "Ascl1")

## $umap

tSNE_PCA(mtec, "Hmgb2")

## $umap

tSNE_PCA(mtec, "Dclk1")

## $umap

tSNE_PCA(mtec, "Fezf2")

## $umap

tSNE_PCA(mtec, "cycle_phase", color = c("black", "red", "purple"))

## $umap

Control cell processing

Combine control samples

In the Snakemake pipeline, this correponds to rule combine_controls and runs the script create_combined_seurat.R found in the controls/scripts directory

Analyze control samples

In the Snakemake pipeline, this correponds to rule combine_controls and runs the script analysis_driver.R found in the controls/scripts directory

Making the paper figures

In the Snakemake pipeline, this correponds to rule combine_controls and runs the script figures.R found in the scripts directory

If you haven’t followed the steps outlined above to pre-process the data and would just like to make figures using the final seurat objects, run this command

# Change this to load from mTEC.10x.data
load("/home/kwells4/mTEC_dev/mtec_snakemake/aireTrace/analysis_outs/seurat_aireTrace.rda")

This loads in complete Seurat objects that were created using the Snakemake pipeline.